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Abstract 

Nonnegative matrix factorization (NMF) is a powerful technique for dimension re¬ 
duction, extracting latent factors and learning part-based representation. For large 
datasets, NMF performance depends on some major issues: fast algorithms, fully par¬ 
allel distributed feasibility and limited internal memory. This research aims to design a 
fast fully parallel and distributed algorithm using limited internal memory to reach high 
NMF performance for large datasets. In particular, we propose a flexible accelerated 
algorithm for NMF with all its Li L 2 regularized variants based on full decomposition, 
which is a combination of an anti-lopsided algorithm and a fast block coordinate de¬ 
scent algorithm. The proposed algorithm takes advantages of both these algorithms to 
achieve a linear convergence rate of 0(1 — ||q||^ in optimizing each factor matrix when 
hxing the other factor one in the sub-space of passive variables, where r is the number 
of latent components; where ^/r < IIQII 2 < r. In addition, the algorithm can exploit 
the data sparseness to run on large datasets with limited internal memory of machines. 
Furthermore, our experimental results are highly competitive with 7 state-of-the-art 
methods about three signihcant aspects of convergence, optimality and average of the 
iteration number. Therefore, the proposed algorithm is superior to fast block coordinate 
descent methods and accelerated methods. 
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1 Introduction 


Nonnegative matrix factorization (NMF) is a powerful technique widely used in applications 
of data mining, signal processing, computer vision, bioinformatics, etc. [Zhallb]. Funda¬ 
mentally, NMF has two main purposes. First, it reduces dimension of data making learning 
algorithms faster and more effective as they often work less effectively due to the curse of 
dimensionality [HV05]. Second, NMF helps extracting latent components and learning part- 
based representation, which are the signihcant distinction from other dimension reduction 
methods such as Principal Component Analysis (PCA), Independent Component Analysis 
(ICA), Vector Quantization (VQ), etc. This feature originates from transforming data into 
lower dimension of latent components and non-negativity constraints [DS04, Gill4, LS'''99]. 

In the last decade of fast development, there were remarkable milestones. The two hrst 
milestones in early days of the NMF historical development were its mathematical formula¬ 
tions as positive matrix factorization with Byzantine algorithms [PT94] and as parts-based 
representation with a simple effective algorithm [LS'^99]. The last decade has witnessed 
the rapid NMF development [Zhallb, WZ13]. Various works on NMF can be viewed 
in three major perspectives: variants of NMF, algorithms and applications. In partic¬ 
ular, variants of NMF are based on either divergence functions [SLOl, Zhalla], or con¬ 
straints [Hoy04, PMCK+Ob], or regularizations [ChoOS, LAW ■*’07]. Most NMF algorithms 
were developed along two main directions: geometric greedy algorithms [TKWBll] and it¬ 
erative multiplicative update algorithms. Although geometric greedy algorithms are usually 
fast, they are hard to trade off complexity, optimality, loss information and sparseness. 

More recently, it is well recognized that the most challenging problems in iterative mul¬ 
tiplicative update algorithms for NMF are fast learning, limited internal memory, parallel 
distributed computation, among others. In particular, fast learning is essential in learning 
NMF models from large datasets, and it is indeed difficult to carry out them when the num¬ 
ber of variables is very large. In addition, the limited internal memory is one of the most 
challenging requirements for big data [GWLT13], because data has been exploring rapidly 
while the internal memory of nodes is always limited. Finally, parallel and distributed com¬ 
putation makes NMF applications feasible for big data [LYF+IO]. 

To deal with these challenges, this work develops an accelerated algorithm for NMF 
and its Li L 2 regularized variants having several major advantages that are summarized in 
Table 1. In this paper, we contribute hve folders as follows: 

• NMF and its variants'. We fully decompose NMF and its Li L 2 regularized variants 
into non-negative quadratic programming problems. This decomposition makes the proposed 
algorithm flexible to adapt all Li L 2 regularized NMF in an unihed framework that can 
trade-off the quality of information loss, sparsity and smoothness. 

• Algortihm: We employ a combinational algorithm of an anti-lopsided algorithm and 
a fast block coordinate descent algorithm for non-negative quadratic programming. The 
algorithm reduces variable scaling problems to achieve linear convergence rate of (1 — 

in optimizing each factor matrix in the sub-space of passive variables, which is advanced to 
fast coordinate methods and accelerated methods in terms of efficiency as well as convergence 
rate. In addition, the size of optimization problem is reduced into r [r m,n), which is the 
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Table 1: Comparison Summary of NMF solvers 


Criteria 

Inexact 

Exact 

Accelerated 

MUR 

PrG 

Qn 

Nt 

AcS 

BIP 

FCD 

AcH 

Ne 

Alo 

Guaranteed Convergence 

X 

X 

X 

X 

X 

X 

X 

X 


IIQIb) 

Exploit Data Spareness 

X 

X 

X 

X 

X 

X 

/ 

/ 

X 

/ 

Limited Internal Memory 

0{mn + r{r + n + m 

) 

0{r{r + n)) 

Fully Parallel &; Distributed 

XXX 

X 

X X 

X 

X 

X 

/ 

Optimization Problem Size 

r(m, n) 

r 

r(n, m) 

(m, n) 

r(m, n) 

r 


/ means considered, and /means not considered 

< IIQIb < r, n X m is the data matrix size, r is the number of latent components 
(m, n) = max(m, n), and r(m, n) = r.max(m, n) 

Abbreviations: MUR: Multiplicative Update Rule [LS"*'99]; PrG: Projected Gradient meth¬ 
ods [Lin07b]; Nt: Newton-type methods [KSD07]; Qn: Projected Quasi-Newton [ZC06]; AcS: Fast 
Active-set-like method [KPOSa]; BIP: Block Principal Pivoting method [KPOSb]; FCD: Fast Coor¬ 
dinate Descent methods with variable selection [HDll]; AcH: Accelerated Hierarchical Alternating 
Least Squares [GG12]; Nev: Nesterov’s optimal gradient method [GTLY12]; Alo: The proposed 
method. 

smallest among the state-of-the-art methods. Hence, the algorithm has the low complexity 
and converges very fast to the optimal solution, and it is highly potential to be applied in 
alternating least squares methods for factorization models. 

• Parallel and Distribution: The proposed algorithms are fully parallel and distributed 
on limited internal memory systems, which is crucial for big data when computing nodes 
having limited internal memory that cannot hold the whole dataset. 

• Implementation: The proposed algorithms are convenient to implement for hybrid 
multi-core distributed systems because this algorithm works on each individual instance and 
each latent feature. 

• Comparision: This is the first time that state-of-the-art algorithms in different research 
directions for NMF are compared together. 

The rest of paper is organized as follows: Section 2 discusses the background and related 
works of NMF; Section 3 mentions our proposed algorithm; Section 4 gives a complexity anal¬ 
ysis of our proposed algorithms; Section 5 experimentally compares our proposed algorithm 
with state-of-the-art algorithms for NMF among remarkable approaches; our conclusion is 
stated in Section 6. 


2 Background and Related Works 

2.1 B ackground 

Mathematically, NMF in Frobenius norm is defined as follows: 

Definition 1 [NMF]: Given a dataset consisting of m vectors in a n-dimension space 
V = [Vi,V 2 , G where each vector presents a data instance. NMF seeks to 
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decompose V into a product of two nonnegative factorizing matrices G and F, where G = 
[Gi, ...,Gr] € and F = [Fi,Fm] € are the latent component matrix and 

the coefficient matrix respectively, V ~ GF, in which the guality of approximation can he 
guaranteed by the objective function in Frobenius norm: D{y\\GF) = \\V — GFII2. 

Although NMF is a non-convex problem, optimizing each factor matrix when fixing the 
other one is a convex problem. In other words, F can be traced when G is fixed, and 
vice versa. Furthermore, F and G have different roles although they are symmetric in the 
objective function. G are latent components to represent data instances V by coefficients 
F. Hence, NMF can be considered as a latent factor model of latent components G, and 
learning this model is equivalent to find out latent components G. Therefore, in this paper, 
we propose an accelerated parallel and distributed algorithm to learn NMF models G for 
large datasets. 

2.2 Related works 

NMF algorithms can be divided into two groups: the greedy algorithms and the iterative 
multiplicative update algorithms. The greedy algorithms [TKWBll] are often based on geo¬ 
metric interpret-ability, and they can be extremely fast to deal with large datasets. However, 
it is hard to trade off complexity, optimality, loss information and sparseness. The iterative 
multiplicative update algorithms such as “two-block coordinate descent” often consist of two 
steps, each of them fixes one of two matrices to replace the other matrix for obtaining the 
convergence of the objective function. There are numerous studies on these algorithms, see 
Table 1, because NMF is nonconvex, though two steps corresponding to two non-negative 
least square (NNLS) sub-problems are convex [GTLY12, KHP14]. In addition, various con¬ 
straints and optimization strategies have been used to trade off the convexity, information 
loss, complexity, sparsity, and numerical instability. 

Based on the optimization updating strategy, these iterative multiplicative update algo¬ 
rithms can be further divided into three sub-groups: 

• Inexact Block Coordinate Descent: The algorithms’ common characteristic is their 
usage of gradient methods to seek an approximate solution for NNLS problems, which is 
neither optimal nor fulfilling of fast approximations and accelerated conditions. Lee et al. 
[LS^99] proposed the (basic) NMF problem and simple multiplicative updating rule (MUR) 
algorithm using first-order gradient method to learn the part-based representation. Seung at 
al. [SLOl] concerned rescaling gradient factors with carefully selected learning rate to achieve 
a faster convergence rate. Subsequently, Lin [Lin07a] modified MUR, which is theoretically 
proved getting a stationary point (a local minimum optimization). However, that algorithm 
cannot improve the convergence rate. Berry et al. [BBL+07] projected nonnegative least 
square (PNLS) solutions into nonnegative quadratic space by setting negative entries in the 
matrices to zero. Although this algorithm does not guarantee the convergence, it is widely 
applied in real applications. In addition, Bonettini et al. [Bonll] used line search based on 
Amijo rule to obtain better solutions for matrices. Theoretically, this method can achieve 
optimal solutions for factor matrices as exact block coordinate descent group, but it very 
slowly tends to stationary points because the line search is time-consuming. 
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• Exact Block Coordinate Descent: In contrast to the first sub-group, the common charac¬ 
teristic in this group is obtaining optimal solutions for two NNLS problems in each iteration. 
Zdunek et al. [ZC06] employed second-order quasi-Newton method with inverse of Hessian 
matrix to estimate the step size, aiming to a faster convergence than projected methods. 
However, this algorithm may be slow and non-stable because of the line search. Subsequently, 
Kim et al. [KSD07] used rank-one to Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm 
to approximate the inverse of Hessian matrix. Furthermore, Ghih-Jen Lin [Lin07b] proposed 
several algorithms based on projected gradient methods and exact line search. Theoretically, 
this method can obtain more accurate solutions, however it is time-consuming because of 
exact line search and the number of iterations increased by the large number of variables. 
Moreover, Kim et al. [KPOSa, KPOSb] proposed two active-set methods based on Karush- 
Kuhn-Tucker (KKT) conditions, in which the variables are divided into two sets: a free 
set and an active set. Only the free set contains variables that can optimize the objective 
functions. Removing the number of redundant variables makes their algorithms improve 
the convergence rate significantly. However, the method still has heavy computation for 
large-scale problems. 

• Accelerated Block Coordinate Descent: The accelerated methods use fast solution ap¬ 
proximations satisfying accelerated conditions to reduce the complexity and to keep fast 
convergence. The accelerated conditions are different constraints in different methods to 
guarantee convergence to the optimal solution in comparison with the initial value. These 
accelerated methods are developed due to the limitation of inexact methods having slow 
convergence, and exact methods having high complexity in each iteration. Particularly, for 
inexact methods, they have slow convergence because of the high complexity of solution 
approximations in each iteration or a large number of iterations that leads to the highly 
expensive computation between two sequential iterations. Furthermore, the exact methods 
have high complexity in each iteration, however obtaining optimal solutions in every itera¬ 
tion is controversial because it can lead to zig-zag problems when optimizing a non-convex 
function of two independent sets of variables. 

Firstly, Hsieh et al. [HDll] proposed a fast coordinate descent method with the best 
variable selection to reduce the objective function. The algorithm iteratively selects variables 
to update the approximate solution until the accelerated stopping condition maXijDf^ < epinit 
satisfied, where Df- is the reduction of the objective function based on the variable Gjj, and 
Pinit is the maximum initial reduction over the matrix G. Although the greedy update method 
does not have guaranteed convergence, it has the fast convergence speed in many reports. 

Subsequently, Gillis and Glineur [GG12] proposed a number of accelerated algorithms 
using fast approximation by fixing all variables but excepting a single column of factor matri¬ 
ces. This framework improved significantly the effectiveness of multiplicative updates [LSOl], 
hierarchical alternating least squares (HALS) algorithms [GZA07] and projected gradients 
[Lin07b]. These algorithms achieve the accelerated condition in each iteration such as that 
||(^(fc,z+i) _(^(fc, 0||2 ^ is the stopping condition when optimizing the objec¬ 

tive function on G if fixing F. Although these greedy algorithms does not have guaranteed 
convergence, their results are highly competitive with the inexact and exact methods. 
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More recently, Guan et al. [GTLY12] employed Nesterov’s optimal methods to optimize 
NNLS with fast convergence rate 0(1/k'^) to achieve the accelerated convergence condition 
Wqf^ —II2 ^ —lls- Although Guan et a/.’s method [GTLY12] has a fast convergence 

rate 0{l/k‘^), it has several drawbacks such as working on the whole factor matrices, and 
less flexibility for regularized NMF variants. Furthermore, this approach does not consider 
the issues of parallel and distribution, and they require numerous iterations to satisfy the 
accelerated condition because the step size is limited by where L is Lipschitz constant. 

To deal with the above issues of accelerated methods, in next section, we propose an 
accelerated parallel and distributed algorithm for NMF and its regularized Li L 2 variants 
with linear convergence in optimizing each factor matrix when hxing the other factor one. 

3 Proposed Algorithm 

To read easily, this section hierarchically presents our proposed algorithm. First, an iter¬ 
ative multiplicative update accelerated algorithm is introduced. Then, a transformational 
technique fully decomposes the objective functions of NMF into basic computation units 
as nonnegative quadratic programming (NQP) problems. After that, a modihed version 
of the algorithm is proposed to deal with the issues of parallel and distributed systems. 
Subsequently, a combinational method of an anti-lopsided algorithm and a fast coordinate 
descent algorithm is developed to effectively solve NQP problems. Finally, extensions for 
L 1 L 2 regularized NMF is discussed. 

3.1 Iterative multiplicative update accelerated algorithm 

For solving NMF, we employ an iterative multiplicative update accelerated algorithm, like 
expectation-maximization (EM) algorithm, presented in Algorithm 1. This algorithm con¬ 
sists of two main steps: one for hnding is updated F in the iteration) when hxing G 

and the other for hnding G~^ when hxing F. In the hrst step called E-step, we hnd F~^, each 
column of which F^ is the new representation of a data instance Vi in the new space of latent 
components G. Meanwhile, the other one, called M-step, learns new latent components. 


Algorithm 1: Iterative Multiplicative Update Accelerated Algorithm 
Input; Data matrix V = {U}^i G and r. 

Output; Latent components G = {Gk}l=i- 

1 begin 

2 Randomize r nonnegative latent components G ; 

3 repeat 

4 E-step: Fixing G to hnd F~^ such that the accelerated condition is satished; 

5 M-step: Fixing F to hnd G^ such that the accelerated condition is satished; 

6 until Convergence condition is satisfied', 
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3.2 Full decomposition for NMF 

This section discusses decomposing the objective function of NMF into non-negative quadratic 
programming (NQP) problems, which aims to fully parallelize and distribute the NMF com¬ 
putation. Particularly, in Algorithm 1, the E-step is to find new coordinates of data in¬ 
stances in the space of latent components G by minimizing J{V\\GF) = \\V — GEIH = 
W^i ~ Hence, minimizing J{V\\GF) is equivalent to independently minimizing 

\\Vi — GFilll for each instance i since G is hxed. Similarly, the M-step is also equivalent to 
independently minimizing \ \Vj' — for each feature j, where F is fixed. Hence, the 

basic computation units are nonnegative least-squares (NNLS) problems [LH74]. 

For large datasets n, m ^ r, we equivalently turn these problems into nonnegative 
quadratic programmings (NQP); 


equivalent to 


minimize - Ax — 6 2 

subject 

to X ^ 0 G i?’’ 

where 

A e b e Rl 

minimize 

X 

fix) = -x^Hx -|- h^x 

subject to 

X FO. 

where 

H = A^A, h = -A^b 


( 1 ) 


( 2 ) 


Hence, finding new coefficients F^ and new latent components G^ can be fully paralleled 
and distributed into basic computation units as solving NQP problems. 

3.3 Parallel and distributed algorithm using limited internal mem¬ 
ory 

In this section, we design a parallel and distributed algorithm using limited internal memory 
for learning NMF model G, see Fig. 1, which is a modified version of Algorithm 1. 

For large datasets, the computation can be untimely performed in a single process, so 
parallel and distributed algorithm environments are employed to speed up the computation. 
For parallel and distributed systems, we often face two major issues: dependency of com¬ 
putation units and limited internal memory computing nodes. In particular, computation 
units must be independently conducted as much as possible, since any dependency of com¬ 
puting elements will increase the complexity of implementation and the delay of data transfer 
over the network that reduces the performance of system. Furthermore, for these parallel 
distributed systems, computation units are executed on computing nodes within a limited 
internal memory. In addition, accessing external memory will increase the complexity and 
reduce the performance. 

For our proposed approach, the computation can be fully paralleled and distributed, and 
use limited internal memory in computing nodes because the objective function is properly 


Algorithm 2: Parallel and Distributed Algorithm 


Input: Data matrix V = £ -R+”^ and r. 

Output: Latent components G = {gkYk=i- 

1 begin 

2 Randomize r nonnegative latent components G G ; 

3 repeat 

4 Y = /*Y = FV^*/] 

5 LT = 0 G ^ = FF^ /*\ 

6 g = GG^] 

7 maxStop = 0; 

8 /*Parallel and distributed*/ 

9 for i = 1 to m do 

10 /*call Algorithm 3*/; 

11 Fi = Minimizing {x'^Qx/2 — VYG'^x); 

xeR’^to 

12 Y = Y + FiVY] 

13 IH = H + FiFl- 

14 /*Parallel and distributed*/ 

15 for j = 1 to n do 

16 /*call Algorithm 3*/ ; 

17 Gj = Minimizing {x'^Hx/2 
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Yjx) 


until Convergence condition is satisfied] 



Figure 1: Distributed System Diagram for NMF 
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decomposed to NQP problems. Particularly, Algorithm 2 presents a modified version of 
iterative multiplicative update algorithms, in which computation units are fully paralleled 
and distributed. In addition, Q = GG^ is precomputed to reduce the complexity, and finding 
new coefficients Fi can be independently computed and distributed. Remarkably, the most 
heavy computation oiY = FV'^ and H = FF'^ is divided into computing FiV^ and FiF^ 
to be parallel and distributed. 

Particularly, the distributed system using MapReduce is described in Fig. 1. In this 
computing model, data instances and the instance projection are parallel and distributed 
over the Map nodes. The Reduce nodes sum up the results FiFf and FiV^ of the Map 
nodes. Subsequently, in the M-step, the results and FV^ are employed to compute 

latent components G. This M-step computation can be conducted by a single machine 
or a distributed system, which depends on the dimension of problem because the time to 
distribute this computation over the network is usually considerable. 

In comparison with the previous algorithms, this computing model is much more effective 
than the previous models [GNHSll, LYF+IO, SLRIO] by the following reasons: 

• The necessary memory used in computing nodes is 0{size{G,Y, H)) = 0{r{r -|- n)). 
The necessary memory used in the controlled node is 0{size{G, Y, H, Q)) = 0{r{r + n)). In 
practice, approximate solutions of NQP problems should be cached in hard disks in order to 
increase accuracy and reduce the number of iterations. 

• At each distributed iteration, the computation is fully decomposed into basic computa¬ 
tions units, which enhances the convergence speed to the optimal solution because the size 
of optimization is significantly reduced. Furthermore, the expensive computation FR^ and 
FF^ is fully parallelized and distributed over the computing nodes. 

• The computational model is conveniently implemented because computing NMF model 
is divided into basic computation units as NQP problems that are independently solved, and 
the optimization is carried out on vectors instead of matrices. 

In the next section, we propose a novel algorithm. Algorithm 3, to solve approximately 
NQP problems, which is robust and effective because it only uses the first derivative and 
does not consider the ill-condition of matrix inverse. 

3.4 Fast algorithm for nonnegative quadratic programming 

In this section, we briefly review the literature before proposing the novel algorithm to solve 
NQP Problem 2 for real large-scale NMF applications. 

Regarding algorithms for NNLS and its equivalent problem NQP, numerous algorithms 
are proposed to deal with high dimension [CP09]. Generally, methods for solving NNLS can 
divided into two groups: active-set and iterative methods [GP09]. Active-set methods are 
traditional to solve accurately [BDJ97, LH74]. However, they require heavy computation 
in repeatedly computing A)~^ with different set of passive variables. Hence, iterative 
methods that can handle multiple active constraints in each iteration have more poten¬ 
tial for fast NMF algorithms [GP09, KSD06, KDD13]. Hence, iterative methods can deal 
with more large-scale problems. Among the fast iterative methods, the coordinate descent 
method [FHN05] has fast approximation, but has the zip-zag problem when the solution 
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requires high accuracy. In addition, accelerated methods [Nes83] has a fast convergence 
OiXjk^') [GTLY12], which only require the first order derivative. However, one major dis¬ 
advantage of the methods is that they require a big number of iterations because their step 
size is limited by ^ that can be very small for large-scale problems; where M is Lipschitz 
constant. More recently, the anti-lopsided algorithm [NH15] re-scale variables to obtain a 
linear convergence in the sub-space of passive variables. Unfortunately, the passive variables 
are unknown in advance, so several iterations are required to determine them. In addition, 
the complexity of each iteration is considerable about 0{r‘^). 

Therefore, we propose a combinational algorithm of the anti-lopsided algorithm [NH15] 
and the greedy coordinate block descent algorithm [HDll] to reduce the number of iterations 
as well as complexity. Particularly, the proposed algorithm. Algorithm 3, contains two main 
steps: The first step, from Line 4 to Line 7, rescales variables to avoid rescaling problems of 
the first order methods by replacing y = x. * diag{H), we have: 

f(x) = ]^x^Hx + h^x = ^y'^Qy + q^y (3) 

where Q = , ^ and q = , ^ such that %r- = Qa = = 1 for Vi. 

By the way, the rate of change of a quantity through variables equals to a constant and 
the exact line search can converge at an exponential rate of (1 — [NH15] in the sub¬ 

space of passive variables. The passive variables are variables belongs the set P = {xi\xi > 
0 or Vfi < 0} that changes through iterations. 

The second step contains a loop of iterations, from Line 9 to Line 21, each of which is 
clearly divided into two parts: one part from Line 11 to Line 14 inherited from the anti¬ 
lopsided algorithm [NH15] and the other from Line 16 to Line 20 based on the fast coordinate 
descent algorithm [HDll]. The anti-lopsided algorithm guarantees the linear convergence 
(1 — (IIQIIi — ■'") fo ^he sub-space of passive variables to avoid the zip-zag problem of 

the fast coordinate descent algorithm, while the coordinate block descent algorithm speeds 
up the convergence to the final optimal set of passive variables. In addition, the complexity 
of each part is still kept in 0{r‘^). As a result, the proposed algorithm will utilize advantages 
of both algorithms to attain a fast convergence, while retaining the same low complexity 
0{r‘^) of each iteration. 

To comprehend the proposed algorithm’s effectiveness, we consider optimizing Function 4: 

X + [-80 - lOOjx (4) 

The exact search gradient algorithm, from Line 11 to Line 14, starting with xq = [200 20]^ 
performs 59 iterations to reach the optimal solution, see Fig. 2. However, the proposed 
algorithm only needs 1 iterations to reach the optimal solution, see Fig. 3 because we optimize 
Function 5 instead of Function 4; where Function 5 is equivalently obtained by applying the 
steps from Line 11 to Line 14. The exact search gradient algorithm becomes much faster 
because the shape of Function 5 become more sphere, and its derivative is more effective to 
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optimize the objective function. 
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0.1 
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(5) 


Algorithm 3: Fast Combinational Algorithm for NQP 


Input; H G and h G and Xq 
Output; X minimizing + h'^x 

subject to; x ^ 0 

1 begin 


6 

7 

8 
9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 

21 

22 

23 


/*Having a variable maxStop = 0 for each thread of computation */ 

/*Re-scaling variables*/; 

0 = ^ 

^ diag(H)diag{H)'^ ' 

^ ^/diag{H) ’ 

/*Solving NQP; minimizing/(x) = |x^Qx + g^x*/; 

X = Xq. * \/diag{H); 

V/ = Qx + q; 

repeat 

/*Exact Line Search*/; 

V/ = V/[x > 0 or V/ < 0]; 

a = argmina/(xfc - oV/) = 

Xk = [xk-i -aVf] + ; 

^fk = ^fk + Q{Xk - Xk-l)\ 

/*Block Coordinate Descent*/; 
for t=l to n do 

Axi = max(0, [x^]* - - [xk]i Vi; 

p = argmaxi|/(xfc) - f{xk + Ax*)]; 

V/fc = V/fc + QpAxp; 

A Axp, 

until (||/fc||i < e||/o||^) or {\\fk\\l < maxStop); 
maxStop = max{maxStop, H/fcHi); 


iiv/iii 


return 




^/diag{H) 


Moreover, Algorithm 3 only attains approximate solutions because achieving the optimal 
solution is controversial for the reasons that its computation is expensive and it can leads 
to the zig-zag problem in optimizing a non-convex function. In addition, it is necessary to 
control and balance the quality of the convergence to the optimal solution. Hence, we employ 
an accelerated condition (||/fc ||2 < ^ll/olli) to regulate the quality of the convergence to the 
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Figure 2: 59 optimizing steps in iterative exact line search method using the first 
order derivative for the function 4 starting at xq = [200 20]^ 



Figure 3: 1 optimizing steps in iterative exact line search method using the first 
order derivative for the function 5 starting at yo = xq\/ diag{H) 
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optimal solutions of the NQP problems in comparison with initial values and a fast-break 
condition (H/fclH < max Stop) to balance the quality of the convergence among variables in 
each thread of the computation. As a result, the objective function converges faster through 
iterations; and the complexity and average of the iteration number are reduced significantly. 

3.5 Extensions for Li L 2 regularized NMF 

In this section, we consider solutions for Li L 2 regularized NMF variants to control the 
quality of NMF. Li regularized NMF [Hoy02] aims to achieve sparse solutions in optimiza¬ 
tion problems. Usually, only the coefficient matrix F is penalized to control its sparsity. 
Meanwhile, concerning L 2 regularized NMF, the penalty terms of F and G are added to 
control smoothness of solutions in NMF [PPP06]. Fortunately, the objective functions of Li 
L 2 regularized NMF can be turned into NQP problems, of which solutions are completely 
similar to the general NMF. Particularly, in the most general variant, the objective function 
J{X\\GF) is formulated by: 

\\X — GF\\l -I- /ii||F||i -1- /?i||G||i -I- /i2||F’||2 + /^2||G'||2 (6) 

where ||.||i is the Li-norm, ||.||2 is the L 2 -norm, and p-i,/i 2 , A,/52 are regularized parameters 
that tradeoff the sparsity and the smoothness of the information loss. Obviously, both the 
E-step and the M-step need to solve the same NNLS problems when one of the two matrices 
is fixed. For example, in a E-step, we can minimize the objective function by independently 
solving NQP problems when fixing G: 

J{X\\GF) = - GF\\l + p,\\F\\, + p2\\F\\1 + C 

M 

= - GFr^Wl + + pi2FllF^) + C 

m=l 

M 

m=l 

where Q = G^G + 2pil, = —X'^G'^ -|- P 2 ^^ and C is a constant. 

This transformation from minimizing the objective functions into solving NQP problems 
independently is comprehensive to understand and simplify the variants of NMF problems as 
much as possible. As a result, we can conveniently implement NMF and its Li L 2 regularized 
variants in parallel distributed systems as in sub-section 3.3. 

In comparison with the previous algorithms that optimizing the objective function works 
on the whole of matrices, this approach decomposing the objective function is easier to 
parallelize and distribute the computation. Additionally, it is faster to reach the solutions 
because it only performs on a smaller set of variables. 
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4 Theoretical Analysis 

In this section, we investigate the convergence of Algorithm 3 and the complexity of Algo¬ 
rithm 2 using Algorithm 3 


4.1 Convergence 

In this section, we only consider the convergence rate of Algorithm 3 by the general NMF for 
the two following reasons. Firstly, Li regularized coefficients do not affect on the complexity. 
Secondly, L 2 regularized coefficients are often small, and they change Lipschitz constants m 
and M by adding a small positive value, where m and M are positive Lipschitz constants 
of strongly convex function f{x) satisfying ml ^ A MI and / is the identity matrix. 
Hence, L 2 regularized coefficients slightly change the convergence rate because it depends 

Based on [NH15], consider the complexity of Algorithm 3, we have: 

Theorem 4.1. Algorithm 3 linearly converges at the rate of 0{1 — in sub-space 

of passive variables, where y/r < IIQII 2 < r, r is the dimension of solutions or the number of 
latent factors, and k is the number of iterations. 

Proof. From [NH15], we have: 

Remark 1: After {k -|- 1) iterations, — f* < (1 — — /*), where 

ml P V^/ ^ MI, f* is the minimum value of f{x), and f{x) is a strongly convex function 
of the passive variables. 

We have V^/ = Q, and 

x^Ix < QijXiXj = x^Qx since x > 0, Q > 0, and Qa = 1. ^ I ^ Q. 

Moreover, based on Cauchy-Schwarz inequality, we have: 


2=1 j = l 2=1 j = l 2 = 1 j = l 


i=i j=i 




muY, 


X. 


. 2\2 


i=l 


^x'^Qx <\\Q\\2 X^Ix (Vx) ^QP\\Q\\2l 


Finally, ^/f = VEI=i Ql < IIQII 2 = E5=i Qlj < = r since -1 < Qij = 

cos{Hi, Hj) < 1. Therefore, we have: 

Remark 2: I < V^/ = Q Y HQIb-f; where y/r < IIQII2 < r. 

From Remark 2 setting m = 1 and M = ||Q ||2 < r, and Remark 1, we have Theorem 4.2. 


Actually, the exact line search step, from Line 11 to Line 14 in Algorithm 3, guarantees 
linear convergence of 0{1 — in the sub-space of passive variables. However, the set 
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of passive variables changes through iterations. Hence, we employ the fast block coordinate 
descent steps, from Line 16 to Line 20 in Algorithm 3, that rapidly restrict the domain 
of solution to converge to the hnal optimal sub-space of passive variables of the solution. 
Therefore, the proposed algorithm linearly converges and requires very few iterations. 

4.2 Complexity 

In this section, we analyze the complexity of Algorithm 2 using Algorithm 3 to solve NQP 
problems. If we assume that the complexity for each iteration contains 0{nr‘^) in computing 
Q = G^G , 0{mnr) in computing Y = VF'^, 0{mr‘^) in computing H = FF'^, 0{kmr^) 
in computing F and 0{knr‘^) in computing G, where k is the number of iterations, then we 
have the following Lemma 3: 

Theorem 4.2. The complexity of each iteration in Algorithm 2 using Algorithm 3 to solve 
NQP problems is 0{{m + n)r^ + mnr -|- k{m + n)r^). In addition, it is 0{{m + n)r^ + 
rS{mn) -|- k{m + n)r‘^) for sparse data, where S{mn) is the number of non-zero elements in 
data matrix V. 

Theorem 4.2 is signihcant for big data, because the data is usually big and sparse. In other 
words, mn is actually large, but S{mn) is small; so mn S> {m + n)r‘^+ rS{mn) + k{m + n)r‘^. 
Hence, in experimental evaluation Section 5, we prove that our algorithm can run on large 
high-dimension sparse datasets such as Nytimes for an acceptable time. In that dataset, 
mnr 3> rS{mn) 3> (m -|- n)r^, so the running time T{m,n,r) ~ rS{mn) since r. 

Moreover, Table 2 shows a comparison of the complexity in an iteration of our proposed 
algorithms (Alo) with other state-of-the-art algorithms’ in the literature: Multiplicative Up¬ 
date Rule (MUR) [LSOl], Projected Nonnegative Least Squares (PrN) [BBL+07], Projected 
Gradient (PrG) [LinOTb], Projected Quasi-Newton (PQN) [ZG06], Active Set (AcS) [KPOSa], 
Block Principal Pivoting (BIP) [KPOSb], Accelerated Hierarchical Alternating Least Squares 
(AcH) , Fast Goordinate Descent Methods with Variable Selection (FGD) [HD 11], and Nes¬ 
terov’s Optimal Gradient Method (Nev) [GTLY12]. It can be seen that the complexity of 
our proposed algorithm is highly comparable with that of other algorithms, and the speed 
of algorithms depend on the number of iterations. In the experimental evaluation, we will 
show that the iteration number of our algorithm is highly competitive with other algorithms’. 
Remarkably, moreover, our proposed algorithm has the following properties that other algo¬ 
rithms has yet considered: 

• Exploit the sparseness of datasets, 

• Runnable for big datasets in limited internal memory systems, 

• Gonvenient to implement in fully paralleled and distributed systems. 
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Table 2: Complexity of an iteration in NMF solvers 


Solver 

Complexity (O) 

MUR [LSOl] 

mnr + [m + n)r‘^ 

PrN [BBL+07] 

mnr + [m + n)r‘^ -|- 

PrG [Lin07b] 

(m -|- n)r^ -|- rmn -|- kt{m + n)r‘^ 

PQN [ZC06] 

k{mnr + -|- n^r^) 

BIP [KP08b] 

[m -|- n)r^ -|- mnr -|- k{m -|- n)r^ 

AcS [KP08a] 

(m -|- n)r‘^ + rmn + k{m + n)r‘^ 

FCD [HDll] 

[m + n)r^ + rS{mn) + k{m + n)r‘^ 

AcH [GG12] 

[m + n)r‘^ + rS{mn) -|- k{m -|- n)r^ 

Nev [GTLY12] 

(m -|- n)r‘^ + mnr + k{m + n)r‘^ 

Alo 

{m + n)r‘^ + rS{mn) + k{m + n)r‘^ 


where m, n is the matrix size, r is the number of latent components, k is the average number of iter¬ 
ations, t is the average number of internal iterations, and S{mn) is the number of non-zero elements 
of data matrix V. To easily compare among the algorithms, we consider r update times for Algo¬ 
rithm FCD as one iteration because the complexity of one update is 0(r), while the complexity of 
one iteration in other accelerated algorithms is O(r^). 

5 Experimental evaluation 

In this section, we investigate the effectiveness of the proposed algorithm Alo by comparing 
it to 7 carefully selected state-of-the-art NMF solvers belongs to different approaches: 

• MUR; Multiplicative Update Rule [LS’''99], 

• PrG; Projected Gradient Methods [Lin07b], 

• BIP; Block Principal Pivoting method [KP08b], 

• AcS; Fast Active-set-like method [KP08a], 

• FCD: Fast Coordinate Descent methods with variable selection [HDll], 

• AcH; Accelerated Hierarchical Alternating Least Squares [GG12], 

• Nev: Nesterov’s optimal gradient method [GTLY12]. 

Test cases; In this experiment, we design two tests using four datasets shown in Table 
3. In the hrst test, 3 typical datasets with different sizes are used; Faces^, Digits^ and Tiny 

'^http: //cbcl .mit. edu/cbcl/software-datasets/FaceData.html 
^http; //yann. lecun.com/exdb/mnist/ 
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Table 3: Dataset Information 


Data-sets 

m 

n 

r 

Maxiter 

Faces 

6977 

361 

60 

300 

Digits 

6.10^ 

784 

80 

300 

Tiny Images 

5.10^ 

3,072 

100 

300 

Ny times 

3.10^ 

102,660 

100,...,200 

300 


Images For these tests, the algorithms are compared in terms of convergence, optimality, 
and average of the iteration number to investigate their performance and effectiveness. Addi¬ 
tionally, average of the the iteration number k for approximate solutions of the sub-problems 
as NNLS or NQP is to compare the complexity of algorithms. In the second test, a large 
dataset containing tf-idf values computed from the text dataset Nytimes^ is used to verify 
the performance and the feasibility of our parallel algorithms on sparse large datasets. 

Environment settings: To be fair in comparison, for the hrst test, the programs of 
compared algorithms are written in the same language Matlab 2013b, run by the same 
computer Mac Pro 8-Core Intel Xeon E5 3 GHz RAM 32 GB, and initialized by the same 
factor matrices Go and Fq. The maximum number of threads is set to 10 while keeping 2 
threads for other tasks in the operation system. For the second test, the proposed algorithm 
is written in Java programming language to utilize the data sparseness. 

Source code; The source codes of MUR, PrG, BIP, AcS, FCD, AcH, and Nev are 
downloaded from and For convenient comparison in the future, we publish all 

the source codes and datasets in 

5.1 Convergence 

In this experiment, we investigate the convergence of algorithms by information loss ^||A — 
GF||| in terms of time and the iteration number. In terms of time, see Fig. 4, the proposed 
algorithm Alo is remarkably faster than the other algorithms for the three different-size 
datasets: Faces, Digits and Tiny Images. Especially, for the largest dataset Tiny Images, 
the distinction between the proposed algorithm and the runner-up algorithm AcH is easily 
recognized. Furthermore, in terms of the iteration number, see Fig. 5, the proposed algorithm 
converges to the stationary point of solutions faster than the others. This observation is clear 
for large datasets as Digits and Tiny Images. The results are signihcant in learning NMF 
models for big data because the proposed algorithm not only converges faster but also uses 

^http://horatio.cs.nyu.edu/mit/tiny/data/index.html 

^https://archive.ics.uci.edu/ml/machine-learning-databases/bag-of-words/ 

^http://www.cs.toronto.edu/~dross/code/nnmf.m 
®https://github.com/kimjingu/nonnegfac-matlab 
^http://www.csie.ntu.edu.tw/~cjlin/ninf/ 

®http;//dl.dropboxusercontent.com/u/1609292/Acc_MU_HALS_PG.zip 
®https;//sites.google.com/site/nmfsolvers/ 

^^https;//bitbucket.org/[aaa-zzz]/alnmf 
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Faces 


Digits 


Tiny Images 



Figure 4: Objective function values ||1/ —GF|||/2 versus CPU seconds for datasets: 
Faces, Digits, and Tiny Images 


Faces Digits Tiny Images 




Figure 5: Objective function values | |U — GF\ ||/2 in terms of the iteration number 
for datasets: Faces, Digits, and Tiny Images 


a less number of iterations, and the time of reading and optimization through a big dataset 
is actually considerable. 

5.2 Optimality 

After more a decade of rapid development, numerous algorithms have been proposed for 
solving NMF as a fundamental problem in dimension reduction and learning representation. 
Currently, the difference of the hnal loss information \\V — fUi/Hl among the state-of-the- 
art methods is inconsiderable in comparison to the square of information ||U|||. However, 
the small difference represents the effectiveness of the optimization methods because NMF 
algorithms often slowly converge when the approximate solution is close to the optimal local 
solution. Hence, in Table 4, the hnal values of the objective function || |I/—hFFf| || investigate 
the optimality and the effectiveness of the optimization methods. Noticeably, Algorithm AcH 
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Table 4: Optimal Values of NMF solvers 


Dataset 


MUR 

PrG 

BIP 

AcS 

FGD 

AcH 

Nev 

Alo 

Faces (xlO®) 


3.142 

2.003 

1.975 

1.975 

1.983 

2.058 

2.003 

1.966 

Digits (x 10^°) 


4.659 

1.639 

1.641 

1.641 

1.644 

1.640 

1.646 

1.638 

Tiny Images ( 

X 

o 

o 

6.925 

3.483 

3.472 

3.472 

3.474 

3.484 

3.513 

3.468 



Table 5: 

Average of Iteration Number k 



Dataset 

MUR 

PrG 

BIP 

AcS 

FGD 

AcH 

Nev 

Alo 

Faces 

1.00 

321.12 

1116.96 

102.09 

1.54 

1.11 

29.21 

1.29 

Digits 

1.00 

36.70 

12503.75 

305.94 

1.00 

1.05 

23.36 

1.00 

Tiny Images 

1.00 

767.45 

12869.12 

1086.51 

1.38 

2.52 

29.32 

1.27 


fast converges over time and has a low average of the iteration number, but it has the optimal 
values much higher than the proposed algorithm because it uses a time-break technique to 
interrupt the optimization algorithm. In addition, the proposed algorithm achieves the best 
optimality for all three datasets. This result additionally represents the robustness of the 
proposed method, which is highly competitive with the state-of-the-art methods. 

5.3 Average of iteration number 

In this section, we investigate the complexity of the NMF solvers by average of the iteration 
number k = approximate solutions of sub-problems as NNLS or 

NQP because the complexity of algorithms mainly depends on this number, see Table 2. Ex¬ 
cept for the original algorithm MUR with one update having the worst result, the proposed 
algorithm Alo employs at least average of the iteration number, see Table 5, especially for 
large datasets. In addition, the proposed algorithm does not employ any tricks to timely 
interrupt before one of the stopping conditions is satisfied, while the highly competitive algo¬ 
rithm AcH uses. Therefore, this result clearly represents the fast convergence of Algorithm 3 
as it is verified by a large number of NQP problems. 

5.4 Running on large datasets 

In this section, we verify the feasibility of the proposed algorithm in learning NMF model 
for large datasets. Particularly, the proposed algorithm is implemented by Java program¬ 
ming language to exploit the data sparseness. Additionally, it runs on the large sparse text 
dataset Nytimes with different numbers of latent components, see Table 3. Interestingly, the 
proposed algorithm can run with hundreds of latent components by a single computer in an 
acceptable time. 

Fig. 6 shows the performance of our algorithm running on the large sparse dataset Ny¬ 
times. Remarkably, the proposed algorithm only uses about 1 iteration on average to sat- 
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number of components r 


number of components r 


iterators 


Figure 6: average of the iteration number /c, average of iteration time, and conver¬ 
gence of ^ in learning NMF model for the dataset Nytimes within the different 
numbers of latent components 


isfy the accelerated condition of approximate solutions. Furthermore, the average of it¬ 
eration time in learning NMF model linearly increases through the different numbers of 
latent components. This result totally hts the complexity analysis when rnm 3> rS{mn) 3> 
{m+n)r‘^ + k{m + n)r‘^, so the complexity T{m,n,r) ^ rS{mn) since r. Additionally, 

the objective function converges to the stationary point at about the iteration within 
the different numbers of latent components r, which is the same with the previous datasets. 

5.5 Regularized NMF extensions 

In this section, we investigate the convergence of algorithms for regularized NMF extensions 
on three datasets: Faces, Digits, and Tiny Images. Due to the lack of available codes and 
the Li L 2 generalization of the other algorithms, only three algorithms AcS, Nev and Alo 
are compared within two regularized cases: p 2 = 10“^ and 112 = ^2 = 10“^, see Fig. 7. In 
comparison with other algorithms for regularized NMF extensions, the proposed algorithm 
Alo converges much faster than algorithms AcS and Nev. 

6 Conclusion 

In summary, our work has two major contributions: 

Regarding nonnegative matrix factorization, we propose a flexible algorithm in an unihed 
framework for NMF and its Li L 2 regularized variants based on full decomposition and a fast 
combinational algorithm of the anti-lopsided algorithm [NH15] and the greedy coordinate 
block descent algorithm [HDll]. The proposed algorithm has linear convergence rate of 
0{\ — m. optimizing each matrix factor in the sub-space of passive variables when hxing 
the other matrix, where r is the number of latent components. The proposed algorithm is 
an advanced version of fast block coordinate descent methods and accelerated methods. In 
theory and practice, the proposed algorithm resolve some current major issues of NMF: fast 
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Faces 


Digits 


Tiny Images 





seconds 


Figure 7: Convergence of regnlarized NMF Extensions for algorithms AcS, Nev and 
Alo within two regnlarized cases: /i 2 = 10“^ and /U 2 = /?2 = 10“^ 


learning algorithm, data sparseness exploit-ability, and parallel distributed feasibility using 
limited internal memory. Furthermore, the proposed algorithm flexibly adapts with all the 
variants of Li L 2 NMF regularizations. 

In experimental comparative evaluation, our algorithm overcomes 7 of the most art-the- 
state algorithms in large datasets about three signihcant aspects of convergence, average of 
the iteration number and optimality. In addition, it can fully be parallelized and distributed 
because the computation using limited internal memory is decomposed into basic computa¬ 
tion units as NQP problems. Concerning the feasibility in real applications, the proposed 
algorithm exploits the data sparseness to learn the huge sparse dataset Nytinies in an ac¬ 
ceptable time by a single machine. Finally, the convergence of the proposed algorithm for 
LiL 2 regularized NMF variants is much faster than that of the existing algorithms. 

Concerning the optimization techniques for alternating least squares methods, we propose 
a fast algorithm. Algorithm 3 for NQP problems, which not only has a linear convergence in 
theory but also is verihed in practice about the three signihcant aspects by a large number 
of NQP problems conducted inside the NMF framework. Hence, we strongly believe that 
the algorithm can be effectively employed for alternating least square methods as the key 
problem in factorization methods. 
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